Columns: The dataset contains 16 columns
age:
the age of an individual (Integer greater than 0)
workclass:
a general term to represent the employment status of an individual
values: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt:
final weight. In other words, this is the number of people the census believes the entry represents.
values: Integer greater than 0.
education:
the highest level of education achieved by an individual.
values: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc,
9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
educational-num:
the highest level of education achieved in numerical form. Integer greater than 0
marital-status:
marital status of an individual. Married-civ-spouse corresponds to a
civilian spouse while Married-AF-spouse is a spouse in the Armed Forces.
values: Married-civ-spouse, Divorced, Never-married, Separated, Widowed,
Married-spouse-absent, Married-AF-spouse.
occupation:
the general type of occupation of an individual
values: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,
Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical,
Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv,
Armed-Forces.
relationship:
represents what this individual is relative to others. For example an
individual could be a Husband. Each entry only has one relationship attribute and is
somewhat redundant with marital status. We might not make use of this attribute at all
values: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race:
Descriptions of an individual’s race
values: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
gender:
the biological sex of the individual, values: Male, Female.
hours-per-week:
the hours an individual has reported to work per week, continuous.
native-country:
country of origin for an individual
values: United-States, Cambodia, England, Puerto-Rico, Canada, Germany,
Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran,
Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal,
Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia,
Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador,
Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
income (the label):
whether or not an individual makes more than 50,000 dollars annually. Values: <=50k, >50k
We decided to drop the two columns: capital-gain, capital-loss. while investigating the data we saw that the histogram of those variables are very dense (most records has the same value). Obviusly these variables have small variance, which indicates that they not explain the data well and we can't conclude new information about the data from them. Graphs are provided later on.
Data set's number of records: 48842
Link to the data set: https://www.kaggle.com/wenruliu/adult-income-dataset?select=adult.csv
import math
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly
plotly.offline.init_notebook_mode()
adult_incomes = pd.read_csv("adult.csv").drop(["capital-gain", "capital-loss"], axis=1).replace("?", "Other")
num_records = len(adult_incomes)
numeric_features = {"age":"Years", "fnlwgt":"Number Of People", "educational-num":"Years", "hours-per-week":"Hours"}
def outlires_analysis(feature_values, feature_values_summary):
"""
outlires analysis of numeric feature values
feature_values: Pandas Series, values of spesific feature
feature_values_summary: Pandas Series, summary of feature_values
return tuple of (number of outlires below lower fence, number of outlires above upper fence)
"""
Q1 = feature_values_summary.get("25%")
Q3 = feature_values_summary.get("75%")
IQR = Q3 - Q1
num_low_outlires, num_high_outlires = 0, 0
pre_upper_fence = Q3 + ((1.5) * IQR)
upper_fence_threshold = feature_values.where(feature_values >= pre_upper_fence).dropna().min()
if not math.isnan(upper_fence_threshold):
num_high_outlires = len(feature_values.where(feature_values >= pre_upper_fence).dropna())
pre_lower_fence = Q1 - ((1.5) * IQR)
lower_fence_threshold = feature_values.where(feature_values <= pre_lower_fence).dropna().max()
if not math.isnan(lower_fence_threshold):
num_low_outlires = len(feature_values.where(feature_values <= pre_lower_fence).dropna())
return num_low_outlires, num_high_outlires
def numeric_feature_analysis(pd_df, feature, value):
"""
printing the feature histogram and BoxPlot anlysis out of pd_df data.
pd_df: Pandas DataFrame
feature: String, name of feature column in pd_df
value: String, value type
"""
feature_values = pd_df[feature]
fig = make_subplots(rows=1, cols=2)
fig.add_trace(
go.Histogram(x=feature_values, name="Histogram"),
row=1, col=1
)
fig.add_trace(
go.Box(x=feature_values, name="BoxPlot"),
row=1, col=2
)
fig.update_layout(height=400, width=900,
title_text=f"Histogam, Boxplot graphs & Statistical measures of {feature}",
xaxis_title=value, yaxis_title="Count", colorway=['rgb(188, 128, 189)', 'rgb(68, 170, 153)'])
fig.show()
feature_values_summary = feature_values.describe()
display(feature_values_summary)
# outlires
num_low_outlires, num_high_outlires = outlires_analysis(feature_values, feature_values_summary)
print("Number of 'low' outlires: ", num_low_outlires,
f"which are {(num_low_outlires / num_records):.4f} % of total records")
print("Number of 'high' outlires: ", num_high_outlires,
f"which are {(num_high_outlires / num_records):.4f} % of total records")
# missing values
num_of_nans = feature_values.isna().sum()
print("Number of missing values: ", num_of_nans,
f"which are {(num_of_nans / num_records):.4f} % of total records")
for feature, value in numeric_features.items():
numeric_feature_analysis(adult_incomes, feature, value)
count 48842.000000 mean 38.643585 std 13.710510 min 17.000000 25% 28.000000 50% 37.000000 75% 48.000000 max 90.000000 Name: age, dtype: float64
Number of 'low' outlires: 0 which are 0.0000 % of total records Number of 'high' outlires: 250 which are 0.0051 % of total records Number of missing values: 0 which are 0.0000 % of total records
count 4.884200e+04 mean 1.896641e+05 std 1.056040e+05 min 1.228500e+04 25% 1.175505e+05 50% 1.781445e+05 75% 2.376420e+05 max 1.490400e+06 Name: fnlwgt, dtype: float64
Number of 'low' outlires: 0 which are 0.0000 % of total records Number of 'high' outlires: 1453 which are 0.0297 % of total records Number of missing values: 0 which are 0.0000 % of total records
count 48842.000000 mean 10.078089 std 2.570973 min 1.000000 25% 9.000000 50% 10.000000 75% 12.000000 max 16.000000 Name: educational-num, dtype: float64
Number of 'low' outlires: 1794 which are 0.0367 % of total records Number of 'high' outlires: 0 which are 0.0000 % of total records Number of missing values: 0 which are 0.0000 % of total records
count 48842.000000 mean 40.422382 std 12.391444 min 1.000000 25% 40.000000 50% 40.000000 75% 45.000000 max 99.000000 Name: hours-per-week, dtype: float64
Number of 'low' outlires: 8286 which are 0.1696 % of total records Number of 'high' outlires: 5210 which are 0.1067 % of total records Number of missing values: 0 which are 0.0000 % of total records
catergorical_features = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "gender",
"native-country", "income"]
def catergorical_feature_analysis(pd_df, feature):
"""
printing the feature histogram anlysis out of pd_df data.
pd_df: Pandas DataFrame
feature: String, name of feature column in pd_df
"""
fig = px.histogram(pd_df, x=feature, color_discrete_sequence=['#D62728'])
fig.update_layout(height=400, width=900,
title_text=f"Histogam of {feature}")
fig.show()
for feature in catergorical_features:
catergorical_feature_analysis(adult_incomes, feature)
dropped_features_data = pd.read_csv("adult.csv")[["capital-gain", "capital-loss"]]
dropped_features_list = dropped_features_data.columns
for feature in dropped_features_list:
numeric_feature_analysis(dropped_features_data, feature, "dollars")
count 48842.000000 mean 1079.067626 std 7452.019058 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 99999.000000 Name: capital-gain, dtype: float64
Number of 'low' outlires: 44807 which are 0.9174 % of total records Number of 'high' outlires: 48842 which are 1.0000 % of total records Number of missing values: 0 which are 0.0000 % of total records
count 48842.000000 mean 87.502314 std 403.004552 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 4356.000000 Name: capital-loss, dtype: float64
Number of 'low' outlires: 46560 which are 0.9533 % of total records Number of 'high' outlires: 48842 which are 1.0000 % of total records Number of missing values: 0 which are 0.0000 % of total records
As we can see from the graphs, these variables have very dense values, most of them are zeros. In both variables we see that ~98 % of the data is zero. Therefore, as we explained in the first paragraph, we have dropped these features from our data.